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1 Resume 

In the case of informative sampling the sampling scheme explicitly or implicitly depends 
on the response variable. As a result, the sample distribution of response variable can¬ 
not be used for making inference about the population. In this research I investigate 
the problem of informative sampling from the Bayesian perspective. Application of the 
Bayesian approach permits solving the problems, which arise due to complexity of the 
models, being used for handling informative sampling. The main objective of the re¬ 
search is to combine the elements of the classical sampling theory and Bayesian analysis, 
for identifying and estimating the population model, and the model describing the sam¬ 
pling mechanism. Utilizing the fact that inclusion probabilities are generally known, the 
population sum of squares of the models residuals can be estimated, implementing the 
techniques of the sampling theory. In this research I show, how these estimates can 
be incorporated in the Bayesian modeling and how the Full Bayesian Significance Test 
(FBST), which is based on the Bayesian measure of evidence for precise null hypothesis, 
can be utilized as a model identification tool. The results obtained by implementation 
of the proposed approach to estimation and identification of the sample selection model 
seem promising. At this point I am working on methods of estimation and identification 
of the population model. An interesting extension of my approach is incorporation of 
known population characteristics into the estimation process. Some other directions for 
continuation of my research are highlighted in the sections which describe the proposed 
methodology. 

2 Introduction 

Inference from a sample to the whole population usually employes estimation of the 
model, holding in the population based only on the sample measurements. In many 
practical situations sample selection probabilities depend directly on the values of the 
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variable which is being investigated (outcome variable) even after conditioning on the 
covariates (auxiliary variables). This typically occurs when the sample selection process 
uses design variables, which are correlated with the outcome variable. In this case the 
sampling design is informative and the distribution of the measurements corresponding 
to the units belonging to the selected sample is different from the distribution of the 
measurements corresponding to the units in the population. If the sampling design is 
informative, the use of standard methods of estimation that ignore sample selection pro¬ 
cess may result in large biases. Hence, one of the most important problems, facing the 
analyst is selection an appropriate model underlying the sample selection process, and 
testing ignorability of the sampling selection process. The approaches to handling infor¬ 
mative sampling, which are proposed in the literature generally focus on estimation of 
unknown parameters, leaving the two defined problems almost unaddressed. The meth¬ 
ods proposed in the literature, typically use either classical inference procedures, such 
as maximum likelihood estimation, based on the approximation of the model holding 
for the sample measurements (Pfeffermann et ah, (1998), Pfeffermann and Sverchkov 
(2003)), or weighting procedures, which use either reciprocals of sampling probabilities 
(Binder (1983), Skinner et al. (1989), Pfeffermann (1993, 1996)), or their modifications, 
which are designed to reduce the variances (Pfeffermann and Sverchkov(1999), Beaumont 
(2008), Kim and Skinner (2013)). The sampling probabilities are generally assumed to 
be known for the sampled units. Application of the maximum likelihood approach re¬ 
quires modeling of the population distribution and specification of the parametric form 
of the conditional expectations of the sample selection probabilities, given the outcome 
variable and the covariates. Two these models define the model, holding in the sample. 
In this case hypothesis testing can be carried out by application of the log-likelihood 
test statistics, the score test or the Wald test, however, these tests generally unstable, or 
computationally complicated. In the second case one can utilize the tests, proposed by 
Pfeffermann and Sverchkov (1999) or by Pfeffermann (1993). 

In this article we apply a Bayesian approach to estimation of unknown parameters 
and hypothesis testing. In principle, the likelihood, based on the sample distribution, 
can be imbedded within a Bayesian model, and inference can be drawn using the MCMC 
techniques. The problem with this approach is that prior information on unknown pa¬ 
rameters is generally unavailable. On the other hand, if improper priors are imposed, 
the resulting posterior distributions may likewise be improper. In addition, the resulting 
models are generally very complicated, which results in an extremely slow convergence 
of the MCMC process even for the models, containing a small number of unknown pa¬ 
rameters. Another restriction of utilizing the sample likelihood is that in some cases the 
model may be unidentifiable (see Section 3.1). As with the case of the methods, based 
on the sample likelihood, we assume in this article a parametric model on the popu¬ 
lation distribution, which models the process generating the population measurements, 
and a parametric form of the expectations of the sample selection probabilities, given 
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the outcome variable and covariates. Assuming that both processes are independent, 
and that the sampling probabilities are observed for all the sampled units, we propose 
to estimate the population model and the expectations of the sample selection probabil¬ 
ities, based on the Hovitz-Tompson estimators for the population sums of squares of the 
residuals of the models. The derivatives of Hovitz-Tompson estimators with respect to 
unknown parameters provide statistics for the Bayesian inference. Conditional distribu¬ 
tion of these statistics, given unknown parameters, can be approximated as multivariate 
normals, since they constitute sums of random variables. Imposing a flat multivariate 
normal prior on the vector of unknown parameters permits application of the MCMC 
techniques and therefore, estimation of unknown parameters. The main advantage of 
application of the Bayesian approach is that it permits solving the problems of hypoth¬ 
esis testing and model identification. For hypothesis testing we utilize the Full Bayesian 
Significance Test (FBST), introduced by Pereira and Stern (1999). The FBST is based 
on the Bayesian measure of evidence, favoring the null hypothesis. This measure is com¬ 
puted using the draws from the posterior distribution. The evidence value estimates the 
probability of tangential set of points having posterior density values higher than the 
supremum of the posterior density under the null hypothesis. The proposed test rejects 
the null hypothesis if this probability is high. The other important problem is identifica¬ 
tion of a model, describing the sample selection process, since statistical inference in this 
case typically requires modeling the expectations of the sampling probabilities given the 
covariates and the outcomes, the forms of which are seldomly known (Pfeffermann et al 
(1998), Pfeffermann and Sverchkov (1999)). 

3 Summary of the Proposed Approaches 

3.1 Estimation Methods 

In this section we provide a brief description of the problem of informative sampling and 
present the main findings in this area. Let Y t denote the value of the outcome variable Y, 
associated with unit i, belonging to a sample S, drawn from a finite population U, and 
let X and V denote the vectors of auxiliary variables, associated with unit i, such that 
dim(Xi) = p and dimiVi ) = q. We assume that the population values are independent 
realizations from a distribution with probability density function f p (yi\xi, 9), where 6 
is an unknown parameter, and that the sample selection mechanism employed a design 
variable Z, so that the probability of the units to be selected is proportional to the value 
of this variable (or to some function of this variable). Throughout this paper we assume 
that the model holding in the population is the normal linear regression model, 


y% = Xi(3 + e», ti ~ A r (0, o 1 ). 


( 1 ) 
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where 9 = (/3, a 2 ). 

Let 7 Tj denote the sample inclusion probability of unit i, and J* denote the sampling 
indicator, which takes the value 1 if unit i was selected to the sample and 0 othervise. 
Therefore, the distribution, holding in the sample for the unit i is a conditional distribu¬ 
tion, f s (yi\xi,Vi) = f(yi\xi,Vi,Ii = 1), which, following Pfeffermann et ah, (1998), can be 
presented as 


f s (Yi\xi,Vi) 


fp{Y%\xi) 


P(Ii 

P{Ii 


1| Vi, Vi) 

1| Xi,Vi) 


( 2 ) 


where P(U = 1| x i: Vi) = f f p (Yi\xi)P(Ii = 1| y^v^dyi Noting that P(J; = 1| y^Vi) = 
JP(Ii = l\xi, Vi, TTi) f (7ii\yi, Vi)d7Ti = E p (7Ti\yi,Vi), the relationship ( 2 ) can be also written 
in the form: 


f s (Yi\xi,Vi) 


f P (Yi\zi) 


E P (^i\yi,Vi) 

E p (ni\xi,Vi) 


( 3 ) 


where the index p of the expectation indicates that it is computed with respect to the 
population distribution f p (yi\xi). If the inclusion probabilities are proportional to the 
values of the design variable Z , the equation ( 2 ) takes the form 


f s (Yi\xi,Vi) 


f P (Yi\xi) 


E p (Zi\yi, Vi) 
E p (Zi\xi, Vi) 


Note that the model, holding in the sample, f s (yi\xi,Vi) is fully determined by the popu¬ 
lation model, f p (yi\xi) and by Epfalyi, Xi). Note also that if selection probabilities do not 
depend on the value of the outcome variable yi, then the distributions in the population 
and in the sample coincide. Utilizing the result obtained by Pfeffermann et ah, (1998), 
which states that under common sampling designs with unequal sampling probabilities, 
when the population measurements are independently drawn from some distribution, the 
sample measurements are asymptotically independent as the population size increases, a 
sample likelihood can be specifies as 


L 


Samp 


Hf P (Yi\xi-,9) 

i=l 


Ep('K i \yi,Vi,~i) 

Ep(7Ti\Xi, Vi) 9, 7) 7 


( 4 ) 


where the unknown parameters 9 and 7 are indexing the sample model and a model 
underlying a sample selection mechanism correspondingly. The authors note that the 
functional form of the expectations, E p (n,\y ll x t ) is not necessarily known, and therefore, 
must be approximated. They propose two possible approximations: 
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( 5 ) 


j 

E p (ni\yi,Vi) « ^Ajyl + h(vt) 

3=0 

where h(vi) = Em=i ^kp x i P and {Aj} and {B kp } are unknown parameters, and 

j 

Ep(ni\y u Vi) « exp[^ + /i(ui)] (6) 

j=o 

The authors point out that under approximation ( 6 ) the resulting sample distribution 
may be not identifiable. For example, if the population distribution is normal, Yi\xi ~ 
iV(/3 0 + x\P, a 2 ) and E p (TTi\yi, Xi) = exp(Ai?/j + g{xi)) for some function g{x) then, by (3) 
it follows that the distribution of Y^Xi, holding in the sample is N((/3o + Aicr 2 ) + x*/3, a 2 ), 
which is not identifiable. In order to avoid identification problems the authors propose to 
split the estimation process into two steps, where in the first step the parameters {Aj} 
and {B kp } are estimated from the observed probabilities 7 r*, and in the second step the 
parameters, indexing the population model are estimated from the likelihood, defined by 
(4) ,with the parameters {Aj} and {B kp } substituted by their estimates. Apart from solv¬ 
ing identifiability problem, when using the approximaion ( 6 ), this approach is designed 
to ease the computation process, which can be cumbersome for both approximations, (5) 
and ( 6 ), if the number of parameters, indexing the resulting sample model is large. In 
order to test sampling ignorability one can use the log-likelihood test statistic, obtained 
by application of the generalized likelihood ratio test, however this test is applicable if 
all the parameters, 6 and 7 are estimated from the likelihood (4). If parameters 7 are 
estimated from the observed probabilities 77 , one can use a Score or a Wald test. The 
described two step procedure was applied by Pfeffermann and Sverchkov (1999) for es¬ 
timation of the parameters of linear regression model with normal error terms. Noting 
that, E s (wi\yi,Vi) = E ^ , where Wi = the authors propose at the first step 

to identify and to estimate the unknown parameters {Aj} and {B kp } by regressing wy 
against (ay,?/*). If the assumptions of normality is dropped then the parameters (3 can 
alternatively be estimated as a solution of 


5 ^ 7——7 - AVi ~ XiP)x ai = 0, s = l,...,p (7) 

i=1 E s (wi\Xi,Vi ) 

where K,(wy|ay, ?y) denote the estimates of the expectations, E s {wi\x^ ry), obtained by 
regressing Wi against ( Xi,yi ) (see Pfeffermann and Sverchkov, 1999 for details). For 
extension of the approach to the generalized linear models see Pfeffermann and Sverchkov 
(2003). The approach based on minimization of (7) can be viewed as a modification 
of probability weighting methods, which use census estimating equations for deriving 
unknown parameters 6: 
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( 8 ) 


N 

h(yi, Xi] 8) = 0 

i—1 

Since the measurements outside of the sample are unobserved, the left handside of ( 8 ) is 
replaced by its Horvitz-Thompson estimator 


y'w i h{y i ,Xj,0) = 0 (9) 

1=1 

If h(yi,Xi]6) = then the estimation method is known in the literature as 

pseudo likelihood estimation method, which was introduced by Skinner et al (1989). 
Under the model (1) the estimators /? are the solutions of the equations 


n 

y ^Wi(yi - Xj/3)x si = 0, s — 1, ...,p (10) 

i=1 

In order to test ignorability of the sampling mechanism, one can use the test proposed 
by Pfeffermann and Sverchkov (1999). Their test employes the relationship, 


E p (Yi\vi) 


E s {wiYi\vi) 

E s (wi\vi) 


( 11 ) 


(see Skinner, 1994). Denoting by ei = y% — E v (Yi\xi) the regression residuals associated 
with the unit i, one can test the hypothesis of the form E p (ef) = E s (e^), k = 1,2..., 
which by (11) is equivalent to testing the hypothesis Corr s (e^,Wi) = 0, k = 1,2..., 
where Corr s denotes the correlation under the sample distribution. The authors point 
out that it generally suffices to test the first 2-3 correlations. This approach can be 
applied to testing ignorability of the sampling mechanism, however, it does not permit 
identification of the model, underlying the sample selection mechanism. There exist a few 
other approaches to testing ignorability of the smpling mechanism. These approaches are 
generally based on the difference between the estimators of the regression coefficients un¬ 
der the assumed model and the model under ignorable sampling design. See Pfeffermann 
(1993) for discussion. 


3.2 The FBST 

As said previously, implementation of the Bayesian approach permits application of the 
FBST for solving the problem of hypothesis testing and model identification. This can be 
carried out by determining whether the fitted model contains non-significant parameters. 
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In our application this reduces to testing nested models, where the more complex model 
is tested versus the model under the null hypothesis, obtained by setting the coefficients 
of some group of variables to zero. Therefore, the FBST can be used as an identification 
tool for selecting the model in the family of the nested models which best fits the data. 

Let us consider a standard parametric statistical model, i.e., for an integer m, 6 e 
0 C is the parameter, g{9) a prior probability density over 0, x is the observation (a 
scalar or a vector), and L x {6) is the likelihood generated by the data x. After the data x 
have been observed, the sole relevant entity for the evaluation of the Bayesian evidence 
value, ev, is the posterior probability (density) for 9 given x, denoted by 


9x{9) = g(9\x) oc g{9)L x {9) (12) 

We are restricted to the case where the posterior probability distribution over 0 is 
absolutely continuous, that is g x (9) is a density over 0. We are focusing on testing of sharp 
hypothesis, which state that 6 belongs to a sub-manifold 0 # of smaller dimension than 
0. For simplicity we use H for Qh in sequel. Let r{9) be a reference density on 0 such 
that the function s(9) = g x [6)/r[6) is called the relative surprise. Now consider a sharp 
hypothesis H : 9 G 0#. Let s = sups(0) and T = {9 e 0 : s(9) > s}. The Bayesian 

H 

evidence value agains H tis defined as the posterior probability of the tangential set, i.e., 

ev = P{9 g |x) = J g x (9)d9 (13) 

Note that the evidence value, supporting H, 1 — ev, is not an evidence against the 
alternative hypothesis A. Equivalently, ev does not constitute an evidence in favor of A, 
although it is against H. The FBST rejects H whenever ev is small. 

4 Proposed Approach 

4.1 Estimation Equations and Bayesian Model 

As noted by Pfeffermann et ah, (1998), the situation, often occurring in practice is where 
the only design information, available to the analyst is the vector of sample inclusion 
probabilities of the sample units, and possible, also the sample values of Z. In this 
research we assume that we observe both, sample inclusion probabilities and the values of 
the design variables, and that sampling probabilities are proportional to the corresponding 
values of the design variable. Let the population outcomes follow the model defined by 
( 1 ) and assume for simplicity that 
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V ilv + lyUii i n 


E(Tri\vi,yi) = 


(14) 


We also assume that the population realizations of Z are independent between the units. 

Remark 1. The defined sample selection model is a special case of a model (5) where 
J = 1 and h() is a linear function of the covariates. However our approach can be easily 
extended to the situations where J > 1 and h() is a polynomial of an order m, for some 
m > 1. The method presented below can also be utilized if conditional expectation of the 
sample selection probabilities is given by (6). 

In this research we differentiate between randomization based inference over all pos¬ 
sible sample selections, and inference, based on both, randomization and model distri¬ 
butions. The former distribution is underlying classical survey sampling inference, while 
the later distribution takes also into account the process of generating the values of the 
finite population. The randomization distribution accounts only for the process of sample 
selection, holding the population values fixed, therefore inference based on this distribu¬ 
tion is restricted to the populations, which are similar to the population of the study (see 
Pfeffermann, 2011 for discussion). On the other hand, inference, based on both, random¬ 
ization and model distributions requires strong assumptions on the design variables. For 
example, in our case assuming a parametric form only for the conditional expectation 
of 7Tj given the values y x and Vi will not be sufficient. In this research we focus on a 
randomization based inference. 

Let W ( 7 ) define the population sum of squares of the regression residuals of the model 
for the design variable, 


N 

w ( 7 ) = ^(tt* - ( 7 vVi + 7 yVi)) 2 (15) 

i =1 

where 7 = ( 7 „, 7 y ) and = ( 7 !, ..., 7 9 ). If all population measurements were available, 
the value of 7 could be obtained as a solution of the estimation equations, 


dW(i) 

dlv s 


y^(7 U - ( 7 v Vi + 7 y yi))v si = 0, s = 1,..., q\ 


dW( 7 ) ^ 

—j - = ZjFi - ( 7 vVi + 7 yVinVi = 0 

0 ^y i= 1 


(16) 


In real situation statistical inference can only be based on the available sample measure¬ 
ments. Denote by W ( 7 ) the Horvitz-Thompson estimator of W ( 7 ), based on the observed 
sample S. Then 
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( 17 ) 


W( 7) = Y, 

ieS 


(7Tj - (7^» + lyVi))* 
1Ti 


Let 


dw( 7 ) _ (^ _ ( 7 tTi + lyVi)) V li 

d lv t 


E 

ieS 


, l = 1 ,..., g; 


and 


AW( 7 ) _ — (Tw^i + i v y%))yi 

d ly 


E 


ly i€S 

Note that the equations specihed above, can be alternatively expressed as 

<9bb(7) _ ^ (vri - ( 7 ^ + 7 w j/i))u K r , , 

/ , _ -'P ^ 


^7«, 


2 — 1 


fli 


and 

dW(7) = (7T» ~ {lv v i + lyViYlVi r 

d ly n 

where /* denotes the sampling indicator. Note that in the specihed equations the observed 
values of the variables Z, Y and V are held fixed and the only source of randomness is 
expressed by the sampling indicators ii,/jv, which can take the values 0 and 1. 
Denote 


■J {Jit •••, Jqt Jq+ 1 ) 


<91D(7) dW{~j) dWj^ Y 

S'! n d~tv, ’ dj v J 


(18) 


Note that 


J {Jit Jqt Jq+l) 


f dW (7) dW{ 7) dW^) V 
V dj vi ’<97^ ’ ^ y 


TV 


2—1 


( 7 Tj - ( 7 nD + l y yi))vu 


Eo{Ii ) — Jl — 0(g+l)xl, 


(19) 


where 0( g+ i) x i denotes a vector of zeros of dimension q + 1, E D denotes the expectation 
over all possible sample selections, for the fixed population values, and 7r, y, v denote 
the vector of inclusion probabilities and the vectors of the population realizations of the 
variables Dand 1/correspondingly. Therefore, the following equations can be specihed: 


Ji — 0(q+i)xi + Wt l — 1 ,..., q + 1 


( 20 ) 
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where u — (zq, zq, ■■■, vq+iY is a g+l-variate random variable. Implication of the equations 
( 20 ) is that even were 7 known, it is unlikely that the components of Ji would be equal 
to zero for any selected sample S due to sampling variability, although we expect them 
to be close to zero. Noting that J; are defined by sums of random variables, we assume 
that given the vector of unknown parameters 7 , the vector of random variables v can be 
approximated by a q + 1 -variate normal distribution, that is 


Z'Kobs; Yobs , v o6s , 7 » 7V(0 (g+ i )x i, £( 7 )) (21) 

where 7 r 0 b s , y Q b s and v 0 b s denote the observed parts of the vectors z, y and v corre¬ 
spondingly. A well known result of a classical sampling theory states that under the 
randomization distribution, the components of the matrix £( 7 ) can be derived as fol¬ 
lows: 


N N ~~ 

’ z ^' 77717 

i =1 j =1 * 3 


7Tij 77, 77j ), 


where e* = 77 - (y v Vi + 7 y yi). 

Then, based on the sample measurements, (22) can be estimated as 


( 22 ) 


S(7) 


J k,l 


££ CiGjVkiVij 

i =1 j =1 


TToTTq 


TTi 


(23) 


Remark 2. Calculation of the components of the variance matrix Y,( 7 ), generally requires 
knowledge of the joint sampling probabilities for the units k and l, where k,l = 1,..., N. 
These probabilities are generally not specified, however, if they are proportional to a design 
variable, it can be shown that the estimator (23) can be rewritten as 


£(7) 


(n—1)5 tt 


k,l 


E n sr^n ejejV k iVij 
1=1 Z-^j = l 7Tj+7 Tj 


Sjt 71 


+ 


Sir 71 


where S n = 1 ^ J f 


sampling probabilities are proportional to some function of the design variable, calculation 
o/£( 7 ) will require knowledge of the design variable value for all sample units. 


In order to apply a Bayesian approach we use the model (21) and a noninformative prior 
on 7 , h( 7 ). Then 


/ (T I ^obs 1 y obs 5 V 0 bs ) OC \ q+1)xl ,S( 7 ) (■ J^obs, Yobs, Vobs] y) Ml) , (24) 

where 0 O + denotes a q + 1 -variate normal density function with the expectation 

vector, 0 ( 9 + i)xi and the variance matrix equal to £( 7 ). 
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Remark 3. Application of the proposed method to the situations where the sample se¬ 
lection mechanism follows model (6) is straightforward. In my research I would like to 
extend the proposed methodology to the cases, where the selection models have more com¬ 
plex forms. 

The regression parameters fl, indexing the population model, can be estimated by 
implementation of the same approach, by defining the population sums of squares of the 
residuals as U(/3) = — Xi/3) 2 and their Horvitz-Tompson estimators, 

(Vi -Xif3) 2 


Remark 4. The methods of estimation of unknown parameters fl, hypothesis testing and 
identification of the model, holding in the population are being developed by the author 
at present. The author is also working on extending the proposed methodology to the 
situations, where the population model belongs to the family of GLM. 

Remark 5. An additional interesting extension of the proposed methodology is incorpo¬ 
ration of constraints in the estimation process. This is also one of the direction of my 
future work. 

Remark 6. As previously mentioned, we assume a flat prior on all unknown parameters. 
It is generally known that the posterior distributions may be heavily influenced by the 
priors. In this research I also intend to investigate and to implement various methods of 
sensitivity analysis. 

4.2 Application of the FBST 

In this section we consider a simple case of hypothesis testing under the model defined 
by (14), H : Ty = 0, where rejection of H implies that the sample selection mechanism is 
ignorable. Extension to testing hypothesis of the form H : f y — 0, where Span (7 \ fl y ) C 
Span{ 7 ) for more complex sample selection models is straightforward. Denote by f post the 
random draws from the posterior distribution ./' (T 1 7r 0 6.s, Yobs, v ofes ), obtained by application 
of the MCMC to the full model, and let dim( 7 post ) = K x (q + 1). Denote by 70 the 
vector of unknown parameters, indexing the sampling selection model under H. Let 



70 = argmax 0 O{ , +1)xl) ,E( 7 )(' / l 7r °&s, y obs, v o6s ; 7)Mt) ( 25 ) 

H 

and 


T = {7 : /(7 Koto, Yobs, v 0 fc s ) > /(boKofe, Yobs, v 0 fe)} (26) 
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It follows (see Reference) then that derivation of the value ev, requires computation of the 
probability P{T\'K 0 b s , y 0 bs, v 0 j, s ). This probability can be estimated, using the posterior 
draws, as follows f post . 


P(T | Hobs: yobs, Vq&s) ~ 


1 

K 


I< 



f ilk l 7 * obs .Yobs ’ v obs) jl ? 
/(70 \ 7r obs ^obs > w obs) / 


(27) 


where 


f (Sfk | 5 y obs i V 0 bs ) 

f (To| ^obs: yobsi ^ obs) 


^ 0 (( 9 + 1 )xl) ,s( 7 fc )(^l 7 r o 6 S ,yo 6 s, v ofes ;7fc)h(7fc) 

—- 7 ~-rTTTTF Pytobs, Yobs, Vote) 

< t > 0( 5Xl ),S(7o) I 0 * Cobs, Yofts; v oft s , 7o)h(7o) 


(28) 


and 


pijtobs i y obs, V obs^) 


/ ^ > o ( ( 9+1 ) Xl ),s(7)('^l 7r ° bs ’ y obs, v 0 bs, l)h(y)d^ 


I 0 0 (9Xl ),S( 7 o)('^l 7r ° 6s ’ yobs, v obs, 7o)^(7o)^70 

Therefore, application of the FBST requires two following steps: 

1. A maximization step (25). 

2. Computation of the ratio of the integrals, defined in (29). 


(29) 


The first step constitutes a standard maximization problem, which can usually be carried 
out by application of a standard multivariate Newton-Raphson algorithm, and therefore 
does not require intensive computations. The second step involves more complex compu¬ 
tations, however, it can be carried out by application of an importance sampling, using 
Monte Carlo techniques, as described by Zacks and Stern (2003). The authors also derive 
the sample size, which is required for attaining the desired precision of the integrals ratio. 


Remark 7. One of the objectives of my research at this point is to approximate the 
value of ptpKobs, y 0 b s , v 0 & s ) in order to avoid computation of the integrals in (29). This will 
significantly simplify the process of calculation of the evidence value. 


5 Simulation Study 

In order to test the proposed approach and to compare its performance to some other 
approaches, described in a literature review section, we performed a small simulation 
study, which consisted of a few experiments. For each experiment we generated M = 200 
populations of size N = 500, and for each generated population we selected one sample of 
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size n = 50, such that the units were randomly selected with the inclusion probabilities 
proportional to the values of the design variable Z. We used the following population 
and sample selection models. 

Yj = 3.5 + 0.8xj — O.lu; + ei, i = 1,..., N (30) 

where e* ~ N( 0,1.5) and the auxiliary variables X and V were generated from Gamma( 1,1) 
and Poisson( 3) distributions correspondingly. The true model for the design variable, Z 
is: 

Zi =4 + 2.5?y + 0.15 % 2 + zy, (31) 

where zy ~ 1V(0,2.5). The primary objective of the simulation study is to identify the 
model, holding for E(Zi\vi,yi ) by testing the following hypotheses. 

1. H : E(Zi\vi, yi) = 7q°^ + 7fV vs E{Zi\v u y t ) = 7 ^ + 7 ) + 7 

2. H : E(Zi\vi, yi) = 7 ^ + 7 '^Vi + 7 ^ 2 /i vs j/*) = 7 ^ + + 7 ^^ + 7 ^ 2 ) 2 /? 

3. H : E(Zi\v h = 7 ^ + 7 ^^ + 7 vs ^(Z^i, ^) = 7 ^ + 7 ( 3) Ui + 

For each experiment, j = 1, 2, 3 and each sample i, i — 1,..., M we compute the value of 

the Bayesian evidence value in favour of H, evji as presented in Section 4.2, using a normal 
flat prior. As mentioned above the FBST rejects H whenever ev is small. In order to 
define a rejection region, we utilize the asymptotic distribution of the evidence value under 
H, provided by Pereira et al. (2008). The tables below summarize proportions of the 
samples, where the hypothesis H was rejected for various significance level. We illustrate 
the results for the proposed test and the classical Likelihood Ratio test (LR), based on the 
model (21). For the first experiment we also applied the test proposed by Pfeffermann and 
Sverchkov (1999), which is based on the identity in (11). As mentioned above this test 
can only be implemented to testing informativeness of the sampling selection mechanism 
and is not applicable if both tested models are informative. We denote by PS'(l) and by 
PS(2) the tests based on the correlations between the sampling weights and the first and 
the second powers of the residuals, as described in the end of the Section 3.1. 


Table 1: Proportions of samples, where H was rejected. 


Significance level 

FBST 

Experiment 1 
LR PS(1) 

PS(2) 

Experiment 2 

FBST LR 

Experiment 3 

FBST LR 

0.010 

0.921 

0.908 

0.910 

0.011 

0 

0.116 

0.955 

0.928 

0.025 

0.966 

0.940 

0.955 

0.017 

0.007 

0.176 

0.980 

0.952 

0.050 

0.977 

0.960 

0.977 

0.045 

0.062 

0.284 

0.980 

0.964 

0.100 

0.989 

0.984 

0.977 

0.119 

0.205 

0.396 

0.995 

0.986 


The table above illustrates that the FBST prsents very good power properties for the 
first and the third experiments. The high probability of rejection of H in the first ex¬ 
periment implies that our method succeeds in revealing that the sampling mechanism is 
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indeed informative (recall that the true model for the sampling selection process included 
the quadratic term of the value of outcome variable). In the third experiment both spec¬ 
ified models are informative. In this case the proposed method shows good performance 
in revealing the correct model. For the second experiment, both models are not correct. 
The results indicate in this case, that both methods generally do not reject the model 
with a smaller number of coefficients. 

In order to validate the proposed methods, we carried out an additional experiment. 
For this experiment the tested models are the same as in the first experiment, however 
the values of the design variable Z were generated under the model under the hypothesis 
H of this experiment. We expect that the proportion of the samples where the hypothesis 
H is rejected will be close to the significance level. 

Table 2: Empirical and theoretical distribution of test statistics under model 

H. _ 


Nominal level 

FBST 

LR 

PS(1) 

PS(2) 

0.025 

0.014 

0.032 

0.033 

0.033 

0.050 

0.037 

0.056 

0.044 

0.061 

0.100 

0.074 

0.096 

0.083 

0.122 

0.250 

0.200 

0.228 

0.200 

0.194 

0.500 

0.510 

0.464 

0.510 

0.467 

0.750 

0.749 

0.732 

0.778 

0.778 

0.900 

0.900 

0.896 

0.883 

0.894 

0.950 

0.946 

0.936 

0.939 

0.939 


In general, the empirical distribution of all the statistics is sufficiently close to the 
nominal values, thus validating the use of all the discussed methods. 
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